Multi-parametric Solution-path Algorithm for 
Instance-weighted Support Vector Machines 



Masayuki Karasuyama 
Naoyuki Harada 

Department of Engineering 
Nagoya Institute of Technology 
Nagoya, Aichi 466-8555, Japan 

Masashi Sugiyama 

Department of Computer Science 
Tokyo Institute of Technology 
Tokyo 152-8552 Japan 

Ichiro Takeuchi 

Department of Engineering 
Nagoya Institute of Technology 
Nagoya, Aichi 466-8555, Japan 



krsym@goat. ics. nitech.ac.jp 
harada@goat.ics.nitech.ac.jp 



sugiOcs.titech.ac.jp 



takeuchi. ichiroonitech. ac.jp 



Editor: 



Abstract 



An instance-weighted variant of the support vector machine (SVM) has attracted consid- 
erable attention recently since they are useful in various machine learning tasks such as 
non-stationary data analysis, heteroscedastic data modeling, transfer learning, learning to 
rank, and transduction. An important challenge in these scenarios is to overcome the com- 
putational bottleneck — instance weights often change dynamically or adaptively, and thus 
the weighted SVM solutions must be repeatedly computed. In this paper, we develop an 
algorithm that can efficiently and exactly update the weighted SVM solutions for arbitrary 
change of instance weights. Technically, this contribution can be regarded as an extension 
of the conventional solution-path algorithm for a single regularization parameter to multi- 
ple instance-weight parameters. However, this extension gives rise to a significant problem 
that breakpoints (at which the solution path turns) have to be identified in high-dimensional 
space. To facilitate this, we introduce a parametric representation of instance weights. We 
also provide a geometric interpretation in weight space using a notion of critical region: 
a polyhedron in which the current affine solution remains to be optimal. Then we find 
breakpoints at intersections of the solution path and boundaries of polyhedrons. Through 
extensive experiments on various practical applications, we demonstrate the usefulness of 
the proposed algorithm. 
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1. Introduction 



The most fundamental principle of machine learning would be the empirical risk minimiza- 
tion, i.e., the sum of empirical losses over training instances is minimized: 



mm 



5> 



where Li denotes the empirical loss for the i-th training instance. This empirical risk min- 
imization approach was proved to produce consistent estimators ( Vapnik . 19951 ). On the 
other hand, one may also consider an instance-weighted variant of empirical risk minimiza- 
tion: 



mm 



y CiLi 



where Cj denotes the weight for the i-th training instance. This weighted variant plays an 
important role in various machine learning tasks: 

• Non-stationary data analysis: When training instances are provided in a sequen- 
tial manner under changing environment, smaller weights are o f ten as s igned to older 
insta nces for imposing some 'forgetting' effect (jMurata et all 120021 : ICao and Tayi . 
20031 ) . 



Heteroscedastic data modeling: A supervised learning setup where the noise 
level in output values depends on input points is said to be heteroscedastic. In het- 
eroscedastic da ta modeling, larger w eights are often assigned to instances with smaller 
noise variance (|Kersting et alJ . l2007l ). The traditional Gauss-Markov theorem (lAlbertl . 
19721 ) forms the basis of this idea. 



• Covariate shift adaptation, transfer learning, and multi-task learning: A 

supervised learning situation where training and test inputs follow different distribu- 
tions is called covariate shift. Under covariate shift, using the importance (the ratio 
of the test an d training input d ensities) as instance weights assures the consistency 
of estimators (|Shimodairal . l2000l ). Similar importance-weighting ideas can be applied 
also to transfer learnin g (where data in one domain is transferred to another domain) 
( Jiang and Zhai . 20071 ) and multi-task learning (wher e multiple learning problems are 



solved simultaneously by sharing training instances) ( Bickel et al. . 20081 ) 



Learning to rank and ordinal regression: The goal of rankin g (a.k.a. ordina l 



regre s sion ) is to give an ordered list of items based on their relevance (jHerbrich et al 



200d: Eiul . [2009l ). In practical ranking tasks such as information retrieval, users are 



often not interested in the entire ranking list, but only in the top few items. In order 
to improve the prediction accur acy in th e top of the list, larger weights are often 
assigned to higher-ranked items ( Xu et al. . 20061 ). 



Transduction and semi-supervised learning: Transduction is a supervised learn- 
ing setup where the goal is not to learn the entire input-output ma pping, but onl y to 
estimate the output values for pre-specified unlabeled input points (jVapnikl . Il995l ). A 
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popular approach to transduction is to label the unlabeled samples using t he current 
estimator, and then mod ify the estimator using the 'self-labeled' samples (jJoachimsl . 
1999; iRaina et all 120071 ). In this procedure, smaller weights are usually assigned to 



the self-labeled samples than the originally-labeled samples due to their high uncer- 
tainty. 

A common challenge in the research of instance-weighted learning has been to overcome 
the computational issue. In many of these tasks, instance weights often change dynami- 
cally or adaptively, and thus the instance-weighted solutions must be repeatedly computed. 
For example, in on-line learning, every time when a new instance is observed, all the in- 
stance weights must be updated in such a way that newer instances have larger weights and 
older instances have smaller weights. Model selection in instance-weighted learning also 
poses a considerable computational burden. In many of the above scenarios, we only have 
qualitative knowledge about instance weights. For example, in the aforementioned ranking 
problem, we only know that higher-ranked items should have larger weights than lower- 
ranked items, but it is often difficult to know how large or small these weights should be. 
The problem of selecting the optimal weighting patterns is an instance of model selection, 
and many instance-weighted solutions with various weighting patterns must be computed 
in the model selection phase. The goal of this paper is to alleviate the computational 
bottleneck of instance-weighted learning. 



In this paper, we fo cus on the support vector machine (SVM) ( Boser et al. . 19921 ; 



Cortes and Vapnik . 19951 ). which is a popular classification algorithm minimizing a reg- 
ularized empirical risk: 



mm 



where R is a regularization term and C > controls the trade-off between the regularization 
effect and the empirical risk minimization. We consider an instan c e- weighted variant of 



SVM , which we refer t o as the weighted SVM (WSVM) (jLin et al 
20021 : lYang et all 120071 ) : 



20021 : iLin and Wand . 



mm 



For ordinary SVM, the solution path algorithm was proposed (jHastie et all 120041 ). 



which allows efficient computation of SVM solutions for all C by utilizing the piecewise- 
linear structure of the solutions w.r.t. C. This te c hnique is kn o wn as parametric program- 
ming in the optimization commu nity (jBestJ, 11982 : iRitterl . Il984l : lAllgower and Geord" . 1 199.4 
Bennett and B rcdcn steinerl. 119971) . and has been applied to variou s machine learning tasks 



2007 : iRosset and Zhul 



20081 : lArreola et al 



2008 



recen t ly (IFine and Scheinberd . 12002: IZhu et al 1. 120041: iBach et all 120061: iGunter and Zhu , 



2007: Lee and Scott 



Takeuchi et al 



2009 



Kanamori et al .1 . |2009j); the incremental- 



20071: ISiostrand et all 20071: IWang et al 



decremental SVM algorithm, which efficiently follows the piecewise-linear solution path 
when some training instances are added or r emoved from the training se t , is al s o based on the 



same parametric programming techniq ue (jCauwenberghs and Poggid . 120011 : lLaskov et al 
2006 : Karasuvama and Takeuchi . 20091 ). 
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The solution path algorithms described above have been developed for problems with a 
single hyper-parameter. Recently, attention has been paid to studying solution- path track- 
ing in two-dimensional hyper-parameter space. For example, Wang et al. ( 20081 ) developed 
a path-following algorithm for regularization parameter C and an insen s itive zone thickness 
e in s u pport vector r egression ( Vapnik et al. . 1996 : Mattera and Haykin . 19991 : Miiller et al 



1999h. iRossetl (120091 1 studied a path-following algori thm for regularization parameter A and 



quantile parameter r in kernel quantile regression (jTakeuchi et all l200fil ). However, these 
works are highly specialized to specific problem structure of bivariate path-following, and 
it is not straightforward to extend them to more than two hyper-parameters. Thus, the 
existing approaches may not be applicable to path- following of WSVM, which contains 
n-dimensional instance- weight parameters c = [C±, . . . , C n ] T , where n is the number of 
training instances. 

In order to go beyond the limitation of the existing approaches, we derive a general solu- 
tion path algorithm for efficiently computing the solution path of multiple instance-weight 
parameters c in WSVM. This extension involves a significant problem that breakpoints (at 
which the solution path turns) have to be identified in high-dimensional space. To facili- 
tate this, we introduce a parametric representation of instance weights. We also provide a 
geometric interpretation in weight space using a not i on of critical region from t he stu dies 
of multi-parametric programming (|Gal and Nedomal . 1 1971 iPistikopoulos et al.l . 120071 ). A 
critical region is a polyhedron in which the current affine solution remains to be optimal 
(see Figure [2]) . This enables us to find breakpoints at intersections of the solution path and 
the boundaries of polyhedrons. 

This paper is organized as follows. Section [2] reviews the definition of WSVM and its op- 
timality conditions. Then we derive the path-following algorithm for WSVM in Section [3j 
Section H] is devoted to experimentally illustrating advantages of our algorithm on a toy 
problem, on-line time-series analysis and covariate shift adaptation. Extensions to regres- 
sion, ranking, and transduction scenarios are discussed in Section [5j Finally, we conclude 
in Section [6l 



2. Problem Formulation 

In this section, we review the definition of the weighted support vector machine (WSVM) 
and its optimality conditions. For the moment, we focus on binary classification scenarios. 
Later in Section [5j we extend our discussion to more general scenarios such as regression, 
ranking, and transduction. 



2.1 WSVM 

Let us consider a binary classification problem. Denote n training instances as {(xi, yi)}f=i> 
where xi £ X CMP is t he input and Vi g j— 1, + 1 } is t he output label. 



SVM (jBoser et al.l . [1992J; ICortes and Vapnik! . Il995l ) is a learning algorithm of a linear 
decision boundary 

/(as) = w T $(x) + b 

in a feature space J 7 , where <1? : X — > T is a map from the input space X to the feature 
space J 7 , w £ J- is a coefficient vector, b € IR is a bias term, and T denotes the transpose. 
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The parameters w and b are learned as 

1 n 

i=l 

where ||| is the regularization term, || • || denotes the Euclidean norm, C is the trade-off 
parameter, and 

[z] + = max{0, z}. 

[1 — yif{xi)] + is the so-called hinge-loss for the i-th training instance. 

WSVM i s an extension o f the ordinary SVM so that each training instance possesses its 
own weight (|Lin et all . 120021 : iLin and Wan"l 120021 : lYang et all 120071 ): 

1 n 

mm -\\w\\l + J2Ci[l-y i f{x l )] + , (2) 



i=i 



where Cj is the weight for the i-th training instance. WSVM includes the ordinary SVM 
as a special case when Ci = C for i = 1, . . . ,n. The primal optimization problem ([2]) is 
expressed as the following quadratic program: 



WM&U 2" 112 ^ (3) 
s.t. Vifixi) > 1 - & > 0, i = 1, . . . , n. 

The goal of this paper is to derive an algorithm that can efficiently compute the sequence 
of WSVM solutions for arbitrary weighting patterns of c = \C\, . . . , C n ] T . 

2.2 Optimization in WSVM 

Here we review basic optimization issues of WSVM which are used in the following section. 
Introducing Lagrange multipliers on > and pi > 0, we can write the Lagrangian of ([3]) 

as 

^ n n n 

j=l i=l i=l 

Setting the derivatives of the above Lagrangian w.r.t. the primal variables w, b, and to 
zero, we obtain 



at, 

ac 
ac 



4^ w = J2aiyi<$>(xi), 

i=l 

n 

: Yl aiVi = °' 

44> aj = Ci-pi, i = 1, ...,n, 
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where denotes the vector with all zeros. Substituting these equations into ([!]), we arrive 
at the following dual problem: 



^ n n n 



r-1 3 -l l -l 

n y ' 

s.t. ^yiOi = 0, < ai < Ci, 

i=l 



where 



Qij — yiVjK{Xi,Xj), 

and K(xi,Xj) = &(xi) T <fr(xj) is a reproducing kernel (Aronszajn, 1950l ). The discriminant 
function / : X — )■ R is represented in the following form: 

n 

f(x) = ^ otiViKix, Xi) + b. 

i=l 

The optimality conditions of the dual proble m ([5]) , called the Karush-Kuhn- Tucker 
(KKT) conditions ( Boyd and Vandenberghel . 20041 ) . are summarized as follows: 

Vif(xi) > 1, if cti = 0, (6a) 
Vifixi) = 1, if < cti < Q, (6b) 

Vif(xi) < 1, if cti = Ci, (6c) 

n 

Vi a i = 0- ( 6d ) 

i=i 

We define the following three index sets for later use: 

Q = {i\a t = 0}, (7a) 
M = {i | < oh < C,}, (7b) 
1 = {i | cti = Ci}, (7c) 

where O, Ai, and I stand for 'Outside the margin' (yif(xi) > 1), 'on the Margin' (yif(xi) = 
1), and 'Inside the margin' (yif(xi) < 1), respectively (see Figure [T]). 

In what follows, the subscript by an index set such as vx for a vector v S M n indicates 
a sub-vector of v whose elements are indexed by X. For example, for v = (a, b, c) T and 
1 = {1,3}, vx = (a, c) T . Similarly, the subscript by two index sets such as Mj^^q fo r a 
matrix M € R nxn denotes a sub-matrix whose rows and columns are indexed by A4 and 
O, respectively. The principal sub-matrix such as M_m^_m is abbreviated as Mw 



3. Solution-Path Algorithm for WSVM 

The path- following algorithm for the ordinary SVM ( Hastie et al. . 20041 ) computes the entire 
solution path for the single regularization parameter C. In this section, we develop a path- 
following algorithm for the vector of weights c = [C±, . . . ,C n ] T . Our proposed algorithm 
keeps track of the optimal ai and b when the weight vector c is changed. 
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Figure 1: The partitioning of the data points in SVM. 



3.1 Analytic Expression of WSVM Solutions 

Let 









yi 




Qu ■ 


Qln 


a = 




, y = 




, and Q = 
















Qnl ■ 


Qnn 



Then, using the index sets (|7bj) and (jTcj) . we can expand one of the KKT conditions, (|6b[) . 



as 



Q M a M + Q MjX ci + VM b = 



(8) 



where 1 denotes the vector with all ones. Similarly, another KKT condition (|6d|) is expressed 
as 



y T M a M + Vx c x = o. 



0) 



Let 



M 



y T M 
Vm Qm 



Then ([8]) and ([9]) can be compactly expressed as the following system of \A4\ + 1 linear 
equations, where \A4\ denotes the number of elements in the set A4: 



M 



b 
a M 



+ 



Q 



MX. 



Solving (jlOjl w.r.t. b and ctMi we obtain 

b 

olm 



Q 



MX 



c x + M' 1 



(10) 



(11) 



where we implicitly assumed that M is invertibldE Since b and olm are a ffi ne w.r.t. cj, 
we can calculate the change of b and olm by (|HP as long as the weight vector c is changed 

1. The invertibility of the matrix M is assured if and only if the submatrix Q M is positive definite in the 
subspace {z £ R'^ 1 | y^z = 0}. We assume this technical condition here. A notable exceptional case 
is that M is empty — we will discuss how to cope with this case in detail in Section 13.31 
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continuously. By the definition of X and O, the remaining parameters ax and olq are 
merely given by 

olx = c x , (12) 
a = 0. (13) 

A change of the index sets Ai, O, and X is called an event. As long as no event occurs, 
the WSVM solutions for all c can be computed by (fTT]) -(fl3|) since all the KKT conditions 
(|6ap - (|6dp are still satisfied. However, when an event occurs, we need to check the violation 
of the KKT conditions. Below, we address the issue of event detection when c is changed. 



3.2 Event Detection 



Suppose we want to change the weight vector from c( old ) to c( ncw ) (see Figure [2|). This can 
be achieved by moving the weight vector c( old ) toward the direction of c( ncw ) — c( old ) . 

Let us write the line segment between c( old ) and c( new ) in the following parametric form 



c{9) 



c (old) + 



,(new) _ _(old) 



, 9 € [0, 1], 



where 9 is a parameter. This parametrization allows us to derive a path- following algorithm 
between arbitrary c( old ) and c( ncw ) by considering the change of the solutions when 9 is 
moved from to 1. Suppose we are currently at c(9) on the path, and the current solution 
is (6, a). Let 



Ac = A9 [ c( new 



>id) 



A9 > 0, 



(14) 



(15) 



where the operator A represents the amount of change of each variable from the current 
value. If A9 is increased from 0, we may encounter a point at which some of the KKT 
conditions (|6a|) — (f6c|) do not hold. This can be checked by investigating the following condi- 
tions. 

Vif(xi) + ViAf(xi) > 1, ieO, 
on + Aaj > 0, i e M, 
ai + Aa { - (d + Ad,) < 0, i € M., 
Vif(xi) + yiAf(xi) < 1, i EX. 

The set of inequalities f)15j) defines a convex polyhedron, called a critical region in the multi- 
parametric programming literature ( Pistikopoulos et al. . 20071 ). The event points lie on the 
border of critical regions, as illustrated in Figure [2j 

We detect an event point by checking the conditions f)15|) along the solution path as 
follows. Using (fTT|) . we can express the changes of b and as 



A6 
Aa M 



where 







M 



Q 



M,X 



A9cf>, 



(16) 



(17) 
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Figure 2: The schematic illustration of path- following in the space of c € R 2 , where the 
WSVM solution is updated from c( old ) to c ( ncw ). Suppose we are currently at c{6). 
The vector d represents the update direction c( ncw ) — c^ old \ and the polygonal 
region enclosed by dashed lines indicates the current critical region. Although 
c(6) + A# max <i seems to directly lead the solution to c^ ncw \ the maximum possible 
update from c{6) is A9d; otherwise the KKT conditions are violated. To go 
beyond the border of the critical region, we need to update the index sets M, X, 
and O to fulfill the KKT conditions. 



Furthermore, yiAf{xi) is expressed as 

yiAf(xi) = [ yi Q iM 



Ab 



+ QirAcx 



= A9iPi, 

where 

ipi= [yi QiM ] <t> + Qi,x( c i° w) - c 

Let us denote the elements of the index set M. as 

M = {mi, . . .,m\ M \} 



(old), 
X > 



(18) 
(19) 



Substituting (|16p and (|18|) into the inequalities (|15p , we can obtain the maximum step-length 
with no event occurrence as 



AO 



mm 

i€{l,...,\M\},jell)0 



a,. 



(20) 



where fa denotes the i-th element of (f> and di = C, 



(new) 



— (j( old \ \Ye used minj{zj} + as a 
simplified notation of minjjzj | Zi > 0}. Based on this largest possible A9, we can compute 
ex and b along the solution path by (|16l) . 

At the border of the critical region, we need to update the index sets M, O, and I. For 
example, if on (i € M) reaches 0, we need to move the element i from A4 to O. Then the 
above path-following procedure is carried out again for the next critical region specified by 
the updated index sets M, O, and X, and this procedure is repeated until c reaches c^ ncw \ 
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3.3 Empty Margin 

In the above derivation, we have implicitly assumed that the index set Ai is not empty- 
when Ai is empty, we can not use (|16|) because M^ 1 does not exist. 
When Ai is empty, the KKT conditions © can be re-written as 



^2QijCj+yib>l, ieO, 
iex 

^2q,jC, ■ /lib < l. iel, 

$>*Ci = o. 



(21a) 
(21b) 
(21c) 



iex 



Although we can not determine the value of b uniquely only from the above conditions, 
(|21aj) and (|21b|) specify the range of optimal b: 



where 



Let 



where 



maxy^ < b < miny^j, 



9i — 1 ~~ QijCj, 

iex 

£ = {i | i € O, yi = 1} U {i | i € X, Vi = -1}, 
U = {i\iGO, yi = -l}U{i\iel, yi = 1}. 



<5 = ^Vidi, 



d- = (7( ncw ) _ £<( old ) 



(22) 



When 5 = 0, the step size AO can be increased as long as the inequality (|22l) is satisfied. 
Violation of (|22[) can be checked by monitoring the upper and lower bounds of the bias b 
(which are piecewise-linear w.r.t. AO) when AO is increased 



where 



u(A0) = max ieU Vi(gi + Ag^AO)), 
&{A0) = min i6£ yfa + A 9i (A0)), 



A 9l (A0) = -AOY^Qijdj- 

iex 



(23) 



On the other hand, when 5^0, AO can not be increased without violating the equality 
condition (|2 lc|) . In this case, an instance with index 



i\ ow = argmax y^ 
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or 



i up = argmm 

actually enters the index set M.. If the instance (we denote its index by m) comes from the 
index set O, the following equation must be satisfied for keeping (|21cj) satisfied: 

A95 = -Aa m y m . 

Since AO > and Aa m > 0, we have 

sign(<5) = sign(-y m ). 

On the other hand, if the instance comes from the index set I, 

A95 = y m (AC m - Aa m ) 

must be satisfied. Since AO > and AC m — Aa m > 0, we have 

sign(<5) = sign(y m ). 

Considering these conditions, we arrive at the following updating rules for b and M: 



6>0 b = y lup g tup , M = {i up }, 
5<0 b = y How g ilow , M = {ii ow }. 

Note that we also need to remove i up and zi ow from O and X, respectively. 



(24) 



3.4 Computational Complexity 

The entire pseudo-code of the proposed WSVM path-following algorithm is described in 
Figure El 

The computational complexity at each iteration of our path-following algorithm is 
the s ame as that for the ordinary SVM (i.e., the single-C formulation) (IHastie et all 
20041 ) . Thus, our algorithm inherits a superior computational property of the original 
path-following algorithm. 

The update of the linear system (j!7p from the previous one at each event point can be 
carried out efficiently with Q(|A^| 2 ) computational cost based on the Cholesky decompo- 



sition rank- on e update ( Golub and Van Loan . 19961 ) or the block-matrix inversion formula 
(jSchottl . [20051 k Thus, the computational cost required for identifying the next event point 
isO(n\M\). 

It is difficult to state the number of iterations needed for complete path-following because 
the number of events depends on the sensitivity of the model and the data set. Several 
empirical results su g gest that the number of events linear l y incr eases w.r.t. the data set size 



(jHastie et all I2004J : iGunter and Zhul . 120071 : IWang et all 120081 ) ; our experimental analysis 



given in Section H] also showed the same tendency. This implies that path-following is 
computationally highly efficient — indeed, in Section HI we will experimentally demonstrate 
that the proposed path-following algorithm is faster than an alternative approach in one or 
two orders of magnitude. 
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1: arguments: 

2: Optimal parameters a and b for c^ old ^ 

3: Sets A4, O, I, and Cholesky factor L of Q M 

4: New weight vector c( ncw ) 

5: end arguments 

6: function WSVM-Path(o:, b, c^- old \M, 0,1, L, c( ncw )) 

7: 9^0, c^c(° ld ) 

8: while 9 7^ 1 do 

9: if A4 is empty then 

10: A9 «- EmptyMargin 

11: else 

12: Calculate 4> by CZI) using Cholesky factor L 

13: Calculate ib bv (fT9|) 

14: Calculate A(9 by ((20]) 

15: end if 

16: If 6» + A0 > 1, then A9 <- 1 - 6 

17: Update a, 6, and c by step length A6 

18: «- + AO 

19: Update A4, O, and Z depending on the event type 

20: Update L (Cholesky factor rank-one update) 

21: end while 

22: end function 

23: function EmptyMargin 

24: if 5{a) / then 

25: Set bias term b by ([MD 

26: A9 <- 

27: else 

28: Trace u(A9) and £(A6>) in ([23]) until u(A0) = £(A9) 

29: end if 

30: return AO 

31: end function 



Figure 3: Pseudo-code of the proposed WSVM path-following algorithm. 



4. Experiments 

In this section, we illustrate the empirical performance of the proposed WSVM path- 
following algorithm in a toy example and two real- world applications. We compared the 
computational cost of the propose d path-follow ing; algorithm with the sequential mini- 
mal optimization (SMO) algorithm (Piatt, 19991 ) when the instance weights of WSVM are 
changed in various ways. In particular, we investigated the CPU time of updating solutions 
from some c( old ) to c ( new ). 
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In the path-following algorithm, we assume that the optimal parameter a as well as the 
Cholesky factor L of Qm for c( old ) has already been obtained. In the SMO algorithm, we 
used the old optimal parameter a as the initial sta rting point (i.e., the 'hot' sta rt) after 
making them feasible using the alpha-seeding strategy toeCoste and WagstaflM ) . We set 
the tolerance parameter in the termination criterion of SMO to 10 ~ 3 . Our implementation 
of the SMO algorithm is based on LIBS VM (jChang and Linl . l200ll ) . To circumvent possible 
numerical instability, we added small positive constant 10 -6 to the diagonals of the matrix 
Q. In all the experiments, we used the Gaussian kernel 



K(x, x') = exp 



7, 
P 



x 



X 



J\\2 



(25) 



where 7 is a hyper-parameter and p is the dimensionality of x. 



4.1 Illustrative Example 

First, we illustrate the behavior of the proposed path- following algorithm using an artificial 
data set. Consider a binary classification problem with the training set {(a?i, yi)}f = i, where 
X{ £ M 2 and yi € {— 1,+1}. Let us define the sets of indices of positive and negative 
instances as K,^\ = {i\yi = —1} and K. + \ = {i\yi = +1}, respectively. We assume that the 
loss function is defined as 



J2vii(yif(xi) <o), 



(26) 



where v% £ {1,2} is the cost of misclassifying the instance (xi,yi), and /(•) is the indicator 
function. Let T>\ = {i\vi = 1} and T>2 = {i\vi = 2}, i.e., T>2 is the set of instance indices 
which have stronger influence on the overall test error than T>\. 

To be consistent with the above error metric, it would be natural to assign a smaller 
weight C\ for i S Pi and a larger weight C2 for i € T>i when training SVM. However, 
naively setting C2 = 1C\ is not generally optimal because the hinge loss is used in SVM 
training, while the 0-1 loss is used in performance evaluation (see (|26p ). In the following 
experiments, we fixed the Gaussian kernel width to 7 = 1 and the instance weight for T>2 
to C2 = 10, and we changed the instance weight C\ for T>\ from to 10. Thus, the change 
of the weights is represented as 



>ld) 

.(old) 
L »2 J 





10 



and 



.(new) 

(new) 
L X>2 



10 
10 
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The two-dimensional input {sj}" =1 were generated from the following distribution: 




(27) 



Figure H] shows the generated instances for n = 400, in which instances in the above four 
cases have the equal size n/4. Before feeding the generated instances into algorithms, we 
normalized the inputs in [0, l] 2 . 

Figure [5] shows piecewise-linear paths of some of the solutions ctj for C\ € [0, 10] when 
n = 400. The left graph includes the solution paths of three representative parameters aj 
for i £ Di, All three parameters increase as C\ grows from zero, and one of the parameters 
(denoted by the dash-dotted line) suddenly drops down to zero at around C\ = 7. Another 
parameter (denoted by the solid line) also sharply drops down at around C\ = 9, and the last 
one (denoted by the dashed line) remains equal to C\ until C\ reaches 10. The right graph 
includes the solution paths of three representative parameters «j for i € T>2, showing that 
their behavior is substantially different from that for T>\. One of the parameters (denoted 
by the dash-dotted line) fluctuates significantly, while the other two parameters (denoted 
by the solid and dashed lines) are more stable and tend to increase as C\ grows. 

An important advantage of the path following algorithm is that the path of the validation 
error can be traced as well (see Figured]). First, note that the path of the validation error 
(26) has piecewise-constant form because the 0-1 loss changes only when the sign of f(x) 
changes. In our path-following algorithm, the path of f(x) also has piecewise-linear form 
because f{x) is linear in their parameters ct and b. Exploiting the piecewise linearity of f(x), 
we can exactly detect the point at which the sign of f{x) changes. These points correspond 
to the breakpoints of the piecewise-constant validation-error path. Figure [6] illustrates the 
relationship between the piecewise-linear path of f(x) and the piecewise-constant validation- 
error path. Figure [7] shows an example of piecewise-constant validation-error path when C\ 
is increased from to 10, indicating that the lowest validation error was achieved at around 
Ci =4. 

Finally, we investigated the computation time when the solution path from C\ = 
to 10 was computed. For comparison, we also investigated the computation time of the 
SMO algorithm when the solutions at every breakpoint were computed. We considered the 
following four cases: the number of training instances was n = 400, 800, 1200, and 1600. 
For each n, we generated 10 data sets and the average and standard deviation over 10 runs 
are reported. Table [T] describes the results, showing that our path-following algorithm is 
faster than the SMO algorithm in one or two orders of magnitude; the difference between 
the two methods becomes more significant as the training data size n grows. 
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Figure 4: Artificial data set generated by the distribution ()27|) . The crosses and circles 
indicate the data points in /C_i (negative class) and /C+i (positive class), respec- 
tively. The left plot shows the data points in T>\ (misclassification cost is 1), the 
middle plot shows the data points in T>2 (misclassification cost is 2), and the right 
plots show their union. 




oti for i 6 T>i on for i £ T>2 



Figure 5: Examples of piecewise-linear paths of «j for the artificial data set. The weights 
are changed from Cj = to 10 for i € T>\ (for i 6 T>2, C% = 10 is unchanged). 
The left and right plots show the paths of three representative parameters aj for 
i € T>i, and for i € T>2, respectively. 



The table also includes the number of events and the average number of elements in the 
margin set M. (see Eq. (I7bp ) . This shows that the number of events increases al most linearly 



i n the s ample size n, wh i ch we ll agr ees with the e mpiri cal results reported in lHastie et al 



(|2004l V iGunter and Zhul d2007l ). and IWang et~all (|2008l V The average number of elements 
in the set M increases very mildly as the sample size n grows. 



4.2 Online Time-series Learning 



In online time-series learning, larger (r esp. smaller) weight s should be assigned to newer 
(resp. older) instances. For example, in Icko and Tav! S) , the following weight function 
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Misclassification 



0/3 1/3 



Figure 6: A schematic illustration of validation-error path. In this plot, the path of misclas- 
sification error rate ^ Yli=l HUif( x i)) — 0) f° r the 3 validation instances (xi,y{), 
(x2,V2), and (^3,2/3) are depicted. The horizontal axis indicates the parameter 8 
and the vertical axis denotes yif(xi), i = 1, 2, 3. The path of the validation error 
has piecewise-constant form because the 0-1 loss changes only when f{xj) = 0. 
The breakpoints of the piecewise-constant validation-error path can be exactly 
detected by exploiting the piecewise linearity of f{x{). 
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O 
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o 
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i 
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Figure 7: An example of the validation-error path for 1000 validation instances of the 
artificial data set. The number of training instances is 400 and the Gaussian 
kernel with 7 = 1 is used. 




Table 1: The experimental results of the artificial data set. The average and the standard 
deviation (in brackets) over 10 runs are reported. 



n 


CPU time (sec.) 


^events 


mean \M\ 




path 


SMO 






400 


0.03 (0.00) 


0.39 (0.01) 


326.70 ( 7.17) 


3.07 (0.03) 


800 


0.08 (0.00) 


2.84 (0.12) 


635.30 (17.47) 


3.27 (0.02) 


1200 


0.19 (0.00) 


10.63 (0.38) 


997.60 (26.85) 


3.38 (0.05) 


1600 


0.35 (0.01) 


28.11 (0.77) 


1424.00 (31.27) 


3.50 (0.02) 
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Figure 8: The weight functions for financial time-series forecasting ( Cao and Tay . 2003I ) . 

The horizontal axis is the index of training instances which is sorted according 
to time (the most recent instance is i = n). If we set a = 0, all the instances are 
weighted equally. 



Ci 



T 



I 



n + 1 i 



Figure 9: A schematic illustration of the change of weights in time-series learning. The 
left plot shows the fact that larger weights are assigned to more recent instances. 
The right plot describes a situation where we receive a new instance (i = n+1). 
In this situation, the oldest instance (i = 1) is deleted by setting its weight to 
zero, the weight of the new instance is set to be the largest, and the weights of 
the rest of the instances are decreased accordingly. 



is used: 



C; — Cn 



1 + exp(a — 2a x 



(28) 



where Co and a are hyper-parameters and the instances are assumed to be sorted along 
the time axis (the most recent instance is i = n). Figure [8] shows the profile of the weight 
function (|28p when Co = 1. In online learning, we need to update parameters when new 
observations arrive, and all the weights must be updated accordingly (see Figure [9|). 

We investigated the computational cost of updating parameters when several new ob- 
servations arrive. The experimental data are obtaine d from the NASDAQ co mposite index 
between January 2, 2001 and December 31, 2009. As ICao and Tavl ( 20031 ) and IChen et al 
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Table 2: Features for financial forecasting (p(i) is the closing price of the ith day and 
EMAfc(i) is the /c-day exponential moving average of the ith day. 



Feature 


Formula 


EM A 15 


p(*)-EMAi5(i) 




RDP-5 


(p(i) -P(i ~ 5))/p(i - 


- 5) * 100 


RDP-10 


(p(i) -p(i - W))/p(i 


- 10) * 100 


RDP-15 


(p(i) -p(i - l5))/p(i 


- 15) * 100 


RDP-20 


(p(i) -p(i- 20)) /p(i 


- 20) * 100 


RDP+5 


(p(i + h)-p(i))/p(i) 


* 100 




p(i) = EMA 3 (z) 





---SMO 




— Path 












■--SMO 




— Path 









(a) 7 = 10 



(b) 7 = 1 



(c) 7 = 0.1 



Figure 10: CPU time comparison for online time-series learning using NASDAQ composite 
index. 



(120061 ). we transformed the original closing prices using the Relative Difference in Percentage 
(RDP) of the price and the exponential moving average (EM A). 

Extracted features are listed in Table [5] (see ICao and Tavl . I2003L for more details). Our 
task is to predict the sign of RDP+5 using EMA15 and four lagged- RDP values (RDP-5, 
RDP-10, RDP-15, and RDP-20). RDP values which exceed ±2 standard deviations are 
replaced with the closest marginal values. We have an initial set of training instances with 
size n = 2515. The inputs were normalized in [0, l] p , where p is the dimensionality of the 
input x. We used the Gaussian kernel (j25[) with 7 € {10, 1, 0.1}, and the weight parameter 
a in (|28p was set to 3. We first trained WSVM using the initial set of instances. Then we 
added 5 instances to the previously trained WSVM and removed the oldest 5 instances by 
decreasing their weights to 0. This does not change the size of the training data set, but 
the entire weights need to be updated as illustrated in Figure M We iterated this process 5 
times and compared the total computational costs of the path-following algorithm and the 
SMO algorithm. For fair comparison, we cleared the cache of kernel values at each update 
before running the algorithms. 

Figure [10] shows the average CPU time over 10 runs for Co £ {1, 10, 10 2 , 10 3 , 10 4 }, 
showing that the path-following algorithm is much faster than the SMO algorithm especially 
for large Cq. 
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4.3 Model Selection in Covariate Shift Adaptation 

Covariate shift is a situation in supervised learning where the input distributions change 
between the training and test phases bu t the conditional distribution of outputs given inputs 
remains unchanged ( Shimodaira . 2000l ). Under covariate shift, standard SVM and SVR are 



biased, and the bias caused by covariate shift can be asymptotically canceled by weighting 
the loss function according to the importance (i.e., the ratio of training and test input 
densities) . 

Here, we apply i mportance-weighted SVMs to brain- computer interfaces (BCIs) 



Here, we apply i mportance-weignted bVMs to brain- computer interfaces [dLjLs) 
(jDornhege et all . 120071 ^ . A BCI is a system which allows for a direct communication from 



man to machine via brain signals. Strong non-stationarity effects have been often observed 
in br ain signals between tr aining and test sessions, which could be modeled as covariate 



shift (ISugivama et al. . 120071) . We used the BCI datasets provided by the Berlin BCI group 



( Burde and Blankertd . 20061 ). containing 24 binary classification tasks. The input features 



are 4-dimensional preprocessed electroencephalogram (EEG) signals, and the output labels 
correspond to the 'left' and 'right' commands. The size of training datasets is around 500 
to 1000, and the size of test datasets is around 200 to 300. 

Although the importance- weighted SVM t ends to have lower bias, it in turns has larger 
estimation variance than the ordinary SVM ( Shimodaira . 2000l ) . Thus, in practice, it is 



desirable to slightly 'flatten' the instance weights so that the trade-off between bias and 
variance is optimally controlled. Here, we changed the instance weights from the uniform 
values to the importance values using the proposed path- following algorithm, i.e., the in- 
stance weights were changed from cj old ^ = Co to C^ 1CW ^ = Co Pte3t ^\ i = 1 . ,n. The 



importance values Pteat ^'\ were estimated by the method proposed in iKanamori et al 

'Ptrain i ) 



1 ft rain K^zJ 

( 2009 ) . which directly estimates the density ratio without going through density estimation 



Of Ptest(x) and p tr ain(x). 

For comparison, we ran the SMO algorithm at (i) each breakpoint of the solution path, 
and (ii) 100 weight vectors taken uniformly in [C^° ld ' ) , C^ ncw ^]. We used the Gaussian kernel 
and the inputs were normalized in [0, l] p , where p is the dimensionality of x. 

Figure [TT] shows the average CPU time and its standard deviation. We examined several 
settings of hyper-parameters 7 6 {10, 1, . . . , 10 -2 } and Co € {1, 10, 10 2 , . . . , 10 4 }. The hor- 
izontal axis of each plot represents Co. The graphs show that our path- following algorithm 
is faster than the SMO algorithm in all cases. While the SMO algorithm tended to take 
longer time for large Co, the CPU time of the path- following algorithm did not increase 
with respect to Cq. 



5. Beyond Classification 

So far, we focused on classification scenarios. Here we show that the proposed path-following 
algorithm can be extended to various scenarios including regression, ranking, and transduc- 
tion. 
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Figure 11: CPU time comparison for covariate shift adaptation using BCI data. 



5.1 Regression 

The support vector regression (SVR) i s a variant of SVM fo r regression problems 
dVapnik et all Il996l : iMattera and Havkinl . Il999l : iMiiller et all Il999r i . 



5.1.1 Formulation 

The primal optimization problem for the weighted SVR (WSVR) is defined by 



mm 

s.t. 



1 

1=1 

Vi - f(xi) < e + 
f(xi)- yi <e + &, 

> 0, i = l,...,n, 



where e > is an insensitive-zone thickness. Note that, as in the classification case, WSVR 
is reduced to the original SVR when Cj = C for i = 1, . . . , n. Thus, WSVR includes SVR 
as a special case. 

The corresponding dual problem is given by 

^ n n n n 

max - o atiQLjK(xi, Xj) - \ai\ + ^ yjOj 

{ ai ti=i 1 i=1 i= i j =1 i=1 

n 

s.t. ai = 0, — Ci < ai < Ci, i = 1, . . . , n. 

The final solution, i.e., the regression function / : X — > R, is in the following form: 

n 

f{x) = ^2 <^iK(x, Xi) + b. 
i=i 
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The KKT conditions for the above dual problem are given as 



\Vi ~ f(xi)\ < e, if ati = 0, 

\Vi ~ f{xi)\ = e, if 0<|ai|<Ci, 

\y% ~ f(x.i)\ > e, if \ai\ = Q, 



j^ai = o. 



(29a) 
(29b) 
(29c) 

(29d) 



i=l 



Then the training instances can be partitioned into the following three index sets (see 
Figure [121): 



O 
£ 
1 



{> 
{> 



\Vi ~ f(xi)\ > e, \ai\ = Ci}, 
\Vi ~ f{xi)\ = e,0 < | Oil < Ci}, 
\y% - f( x i)\ <£,ai = 0}. 



Let 



K 



l 1 

1 K £ 



and s 



sign(yi - /(sri)) 
sign(y n - f{x n )) 



Then, from (j29H . we obtain 



£\-l 



b 

olq = diag(so)co, 



i5 



co + (K b ) 



£\-l 







where diag(so) indicates the diagonal matrix with its diagonal part so- These functions 
are affine w.r.t. c, so we can easily detect an event point by monitoring the inequalities in 
(I29D . We can follow the solution path of SVR by using essentially the same technique as 
SVM classification (and thus the details are omitted). 
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5.1.2 Experiments on Regression 

As an application of WSVR, we consider a heteroscedastic regression problem, where out- 
put noise variance depends on input points. In heteroscedastic data modeling, larger 
(resp. smaller) weights are usually assigned to instances with smaller (resp. larger) vari- 
ances. Since the point-wise variances are often unknown in practice, they should also be 
estimated from data. A standard approach is to alternately estimate the weight vector c 
based on the cu rrent WSVR solution and update the WSVR solutions based on the new 



weight vector c (jKersting et al.l . 120071 ) . 



We set the weights as 



Ci = C ^-, (30) 



where = y% — f(xi) is the residual of the instance (xi, yi) from the current fit f(xi), and a 



is an estimate of the common standard deviation of the noise computed as a = y ^ Y^=i e 1- 
We employed the following procedure for the heteroscedastic data modeling: 

Stepl: Training WSVR with uniform weights (i.e., Ci = Cq, i = 1, . . . , n.). 



Step2: Update weights by ([30]) and update the solution of WSVR accordingly. Repeat 

this step until ^ Ya=i l( e i° ld ^ ~~ e i)/ e i° ld \ ^ 10~ 3 holds, where e( old ) is the previous 
training error. 

We investigated the computational cost of Step2. We applied the above procedure to 
the well-known Boston housing data set. The sample size is 506 and the number of features 
is p = 13. The inputs were normalized in [—1, l] p . We randomly sampled n = 404 instances 
from the original data set, and the experiments were repeated 10 times. We used the 
Gaussian kernel (|25j) with 7 € {10,1,0.1}. The insensitive zone thickness in WSVR was 
fixed to e = 0.05. 

Each plot of Figure [131 shows the CPU time comparison for Cq € {1, 10, ... , 10 4 }, and 
Figure [14] shows the number of iterations performed in Step2. Our path-following approach 
is faster than the SMO algorithm especially for large Cq. 

5.2 Ranking 

Recently, the problem of learning to rank has at tracted wid e interest as a challenging topic 
in machine learning and informat ion retrieval ( Liu ._ 20091 ). Here, we focus on a method 



called the ranking SVM (RSVM) (jrlerbrich et all 1200(1 ) 



5.2.1 Formulation 

Assume that we have a set of n triplets {(xi,yi, qi)}f =1 where X{ £ W is a feature vector of 
an item and € {ri, . . . , r q } is a relevance of X{ to a query qi. The relevance has an order 
of the preference r q y r q ^\ >~ ■ ■ ■ y r±, where r q y r 9 _i means that r q is preferred to r q -\. 
The goal is to learn a ranking function f(x) which returns a larger value for a preferred 
item. More precisely, for items Xi and Xj such that qt = qj, we want the ranking function 
f(x) to satisfy 

Hi >- Vj f(xi) > f(xj). 
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(a) 7 = 10 (b) 7 = 1 (c) 7 = 0.1 

Figure 13: CPU time comparison for heteroscedastic modeling using Boston housing data. 
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Figure 14: The number of weight updates for Boston housing data. 



Let us define the following set of pairs: 

*P = {(hj) I Hi >■ Vj,Qi = Qj} 
RSVM solves the following optimization problem: 

1„ 



mm 



w 



s.t. f(xt) - f(xj) > 1 - fo, e V. 



In practical ranking tasks such as information retrieval, a pair of items with highly 
different preference levels should have a larger weight than those with simil ar preference 
levels. Based on this prior knowledge, ICao et al.1 (j200fih and IXu et al.1 (j200fih proposed to 
assign different weights Cij to different relevance pairs € V. This is a cost-sensitive 
variant of RSVM whose primal problem is given as 



mm 



s.t. f(xi)-f(xj) >l-tij, (i,j)eV. 



Since this formulation is interpreted as a WSVM for pairs of items (i, j) € V, we can easily 
apply our multi-parametric path approach. Note that the solution path algorithm for the 
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cost-s ensitive RSVM is regarded as an extension of the previous work by lArreola et al 
( 20081 ) . in which the solution path for the standard RSVM was studied. 



In this paper, we consider a model selection problem for the weighting pattern 
{Cy}(ij)e"P- We assume that the weighting pattern is represented as 



Cij = cf d) + 9(C% CW) - c£ ld) ), eV,6€ [0, 1], (31) 



where 



C\f d) = Co, (iJ)eV, (32) 
c (new) = {2 y t _ 2 y ])Coj {i j )eVj (33) 

and Co is the common regularization parameteiH. We follow the multi-parametric solution 
path from {C^- }^j^ e -p to {C^ ncw ^}(j j) e -p and the best 6 is selected based on the validation 
performance. 

The performance of ranking algorithms is usually evaluated by some information- 
retrieval measures such as the normalized discounted cumulative gain (NDCG) 
( Jarvelin and Kekalalnen . 2000l ). Consider a query q and define q(j) as the index of the 



j-th largest item among {f{xi)} i& ^ q . =q y The NDCG at position k for a query q is defined 



as 

*L ( 2 v i(J) - 1 7 = 1 

NDCG@A; = ZV^ 2»,cfl-i ' (34) 

Pi { TiUT' J > lj 

where Z is a constant to normalize the NDCG in [0, 1]. Note that the NDCG value in (|34p 
is defined using only the top k items and the rest are ignored. The NDCG for multiple 
queries are defined as the average of jM|)- 

The goal of our model selection problem is to choose 6 with the largest NDCG value. 
As explained below, we can identify 9 that attains the exact maximum NDCG value for 
validation samples by exploiting the piecewise linearity of the solution path. The NDCG 
value changes only when there is a change in the top k ranking, and the rank of two items 
Xi and Xj changes only when f(xi) and f{xj) cross. Then change points of the NDCG can 
be exactly identified because f(x) changes in piecewise- linear form. Figure [T5l schematically 
illustrates piecewise-linear paths and the corresponding NDCG path for validation samples. 
The validation NDCG changes in piecewise-constant form, and change points are found 
when there is a crossing between two piecewise-linear paths. 

5.2.2 Experiments on ranking 

We used the QHSUMED data set from the LETOR package (version 3.0) provided by 
Microsoft Research Asia ( Liu et al. . 20071 ) . We used the query-level normalized version of 



the data set containing 106 queries. The total number of query-document pairs is 16140, 



2. In lChapelle and Kocrthi (2010), ranking of each item is also incorporated to define the weighting pattern. 



However, these weights depend on the current ranking, and it might change during training. We thus, 
for simplicity, introduce the weighting pattern (|33p that depends only on the difference of the preference 
levels. 
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Figure 15: The schematic illustration of the NDCG path. The upper plot shows outputs for 
3 items which have different levels of preferences y. The bottom plot shows the 
changes of the NDCG. Since the NDCG depends on the sorted order of items, 
it changes only when two lines of the upper plot intersect. 



and the number of features is p = 45. The data set provided is originally partitioned into 
5 subsets, each of which has training, validation, and test sets for 5-fold cross validation. 
Here, we only used the training and the validation sets. 

We compared the CPU time of our path algorithm and the SMO algorithm to change 
{Cij}(i,j)€V from flat ones (|32|) to relevance weighted ones l[33|) . We need to modify the 
SMO algorithm to train the model without the explicit bias term b. The usual SMO 
algorithm updates selected two parameters per iteration to ensure that the solution satisfies 
the equality constraint derived from the optimality condition of b. Since RSVM has n o bias 
term, the algorithm is adapt ed to updat e one parameter per iteration ijVogd . l2002h . We 
employed the update rule of IVogtl (|2002l ) to adapt the SMO algorithm to RSVM and we 
chose the maximum violating point as an update para meter. This st r ategy is analogous 
to the maximum violating-pair working set selection of iKeerthi et al.l (|200ll ) in ordinary 
SVM. Since it took relatively large computational time, we ran the SMO algorithm only 
at 10 points uniformly taken in [C 4 - c 



(old) p(new) 



•J 



We considered every pair of initial weight 
C e {10 -5 , . . . , KT 1 } and Gaussian width 7 G {10, 1, 0.1}. The results, given in Figure [TBI 
show that the path algorithm is faster than the SMO in all of the settings. 



The CPU time of the path algorithm in Figure 16(a) increases as Co increases because the 
number of breakpoints and the size of the set M also increase. Since our path algorithm 
solves a linear system with size \M\ using 0(|.M| 2 ) update in each iteration, practical 
computational time depends on \ A4\ especially in large data sets. In the case of RSVM, the 
maximum value of is the number of pairs of training documents m = \V\. For each fold 
of the OHSUMED data set, m = 367663, 422716, 378087, 295814, and 283484. If \M\ « m, 
a large computational cost may be needed for updating the linear system. However, as 
Figure [T71 shows, the size \A4\ is at most about one hundred in this setup. 

Figure H8] shows the example of the path of validation NDCG@10. Since the NDCG 
depends on the sorted order of documents, direct optimization is rather difficult (|Liul 120091 k 
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Figure 16: CPU time comparison for RSVM. 
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Figure 17: The number of instances on the margin \A4\ in ranking experiment. 
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Figure 18: The change of NDCG@10 for 7 = 0.1 and C = 0.01. The parameter 6 in the 
horizontal axis is used as c( old ) + 6>(c( ncw ) - c^). 
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Using our path algorithm, however, we can detect the exact behavior of the NDCG by 
monitoring the change of scores f(x) in the validation data set. Then we can find the best 
weighting pattern by choosing 9 with the maximum NDCG for the validation set. 



5.3 Transduction 



In transductive inference (IVapnikl . ll995l ). we are given unlabeled instances along with labeled 
instances. The goal of transductive inference is not to estimate the true decision function, 
but to classify t he given unlabeled instances correctly. The transductive SVM (TSVM) 
djoachimsl . fi999l ) is one of the most popular approaches to transductive binary classification. 
The objective of the TSVM is to maximize the classification margin for both labeled and 
unlabeled instances. 



mm 



K,«*}Li>™Afe}? =1 



s.t. 



5.3.1 Formulation 

Suppose we have k unlabeled instances {x*}^ =1 in addition to n labeled instances 
{(a;i,yi)}f = l- The optimization problem of TSVM is formulated as 

1 n k 

2MI! + CE& + C *E^ (35) 
i=\ j=i 

yi (w T ^( Xi ) + b)>l-Ci, i 
y*(w T &( Xj ) + b) >!-£*, j 
6 > 0, i 

£* > o, j 

where C and C* are the regularization parameters for labeled and unlabeled data, respec- 
tively, and y* € {—1, +1}, j = 1, . . . are the labels of the unlabeled instances {a;*}^. 
Note that (|35|) is a combinatorial optimization problem with respect to {y*- Ijgji,. The 
optimal solution of (|35p can be found if we solve binary SVMs for all possible combinations 
of h, but thi s is co mputationally intractable even for moderate k. To cope with 

this problem, Joachims ( 19991 ) proposed an algorithm which approximately optimizes (|35|) 
by solving a series of WSVMs. The subproblem is formulated by assigning temporarily 
estimated labels y- to unlabeled instances: 



,n, 
, k, 
,n, 
, k, 



mm 



w 



\l + c 



n 

4=1 j&{j\y 



-i} 



s.t. 



yi (w T $(xi) + b) > 

y*(w T 3>( Xj ) + b) >i-q, 

^ > 0, 

q > o, 

where C* and are the weights for unlabeled instances for jj \ y*- 



1, . . . ,n, 
1, . . . , k, 
l,...,n, 
1 k, 



Hand {j | y* 



+1}, respectively. The entire algorithm is given as follows (see iJoachimsl . Il999l . for details): 
Stepl: Set the parameters C, C*, and k + , where k + is defined as 

10' I yj = + l J = i, •••,«}! 



k x 



n 
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k + is defined so that the balance of positive and negative instances in the labeled set 
is equal to that in the unlabeled set. 

Step2: Optimize the decision function using only the labeled instances and compute the 
decision function values {f(x*)}j =1 . Assign positive label y* = 1 to the top k + 
unlabeled instances in decreasing order of f(x*), and negative l abel = — 1 to the 
remaining instances. Set C*_ and C+ to some small values (see Joachims] . 19991 . for 
details). 

Step3: Train SVM using all the instances (i.e., solve (|36p ). Switch the labels of a pair 
of positive and negative unlabeled instances if the objective value (1351) is reduced , 
where the pair of instances are selected based on {C|}je{i,...,ifc} ( see Joachims] . 19991 . 



for details). Iterate this step until no data pair decreases the objective value. 

Step4: Set C*_ = min(2C*,C*) and C% = min(2C|,C*). If C* > C* and C* + > C*, 
terminate the algorithm. Otherwise return to Step3. 

Our path-following algorithm can be applied to Step3 and Step4 for improving compu- 
tational efficiency. Step3 can be carried out via path-following as follows: 

Step3(a) Choose a pair of positive instance x* m and negative instance x* 

Step3(b) After removing the positive instance x* m by decreasing its weight parameter C m 
from to 0, add the instance x* m as a negative one by increasing C m from to C*_. 

Step3(c) After removing the negative instance x* , by decreasing its weight parameter 
C m ' from C*_ to 0, add the instance x* , as a positive one by increasing C m > from 
to C* + . 

Note that the steps 3(b) and 3(c) for switching the labels may be merged into a single step. 
Step4 also can be carried out by our path-following algorithm. 

5.3.2 Experiments on Transduction 

We compare the computation time of the proposed path-following algorithm and the SMO 
algorithm for Step3 and Step 4 of TSVM. We used the spam data set obtained from the U CI 
machine learning repository ( Asuncion and Newman . 20071 ). The sample size is 4601, and 



the number of features is p = 57. We randomly selected 10% of data set as labeled instances, 
and the remaining 90% were used as unlabeled instances. The inputs were normalized in 
[0, If. 

Figure [19] shows the average CPU time and its standard deviation over 10 runs for the 
Gaussian width 7 G {10, 1, 0.1} and C € {1, 10, 10 2 , . . . , 10 4 }. The figure shows that our 
algorithm is consistently faster than the SMO algorithm in all of these settings. 

6. Discussion and Conclusion 

In this paper, we developed an efficient algorithm for updating solutions of instance- weighted 
SVMs. Our algorithm was built upon multiple parametric programming techniques, and it is 
an extension of existing single-parameter path- following algorithms to multiple parameters. 
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Figure 19: CPU time comparison for the transductive SVM. 



We experimentally demonstrated the computational advantage of the proposed algorithm 
on a wide range of applications including on-line time-series analysis, heteroscedastic data 
modeling, covariate shift adaptation, ranking, and transduction. 

Another important advantage of the proposed approach beyond computational efficiency 
is that the exact solution path is represented in piecewise linear form. In SVM (and its 
variants), the decision function f(x) also has a piecewise linear form because f(x) is linear 
in the parameters a. and b. It enables us to compute the entire path of the validation errors 
and to select the model with the minimum validation error. Let V be the validation set 
and the validation loss is defined as Yliev ^{Uh f( x i))- Suppose that the current weight is 
expressed as c + Od in a critical region TZ using a parameter 8 € R, and the output has the 
form f(xi) = aft + b{ for some scalar constants a, and 6j. Then the minimum validation 
error in the current critical region TZ can be identified by solving the following optimization 
problem: 



min > £(yi,ai9 + bA s.t. c + 9d e TZ. 



(37) 



iev 



After following the entire solution-path, the best model can be selected among the candi- 
dates in the finite number of critical regions. In the case of 0-1 loss, i.e., 



£(y i ,f(x i )) = I{y iS ga(f(x i )) = -l}, 

the problem (|37p can be solved by monitoring all the points at which f{xj) 
Figure EJ). Furthermore, if the validation error is measured by squared loss 

£(yi,f(x i )) = (y i -f(x i )) 2 , 



(see 



the problem (I37p can be analytically solved in each critical region. As another interesting 
example, we described how to find the maximum validation NDCG in ranking problem (see 
Section [5.2p . In the case of NDCG, the problem (|37h can be solved by monitoring all the 
intersections of f(xi) and f(xj) such that yi ^ yj (see Figure [T5]M 

Although we focused only on quadratic programming (QP) machines in this paper, 
similar algorithms can be developed for linear programming (LP) machines. It is well-known 



3. In NDCG case, the "min" is replaced with "max" in the optimization problem ([3^ 
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in the parametric programming literature that the solution of LP and QP have piecewise 
linear form if a linear function of hyper-parameters appears in the constant term of the 
constraints and/or the linear part o f the objective function (see Ritter . 1984 : Best . 19821 : 
Gall. 19951: Pistikopoulos et al,l . 120071 . for more details). Indeed, the parame tric LP technique 
( Gall" 1995) has already been app l ied to several machine learning problems ( Zhu et al. . 2004 : 
Yao and Lee . 2007 : Li and Zhu . 20081 ). One of our future works is to apply the multi- 
parametric approach to these LP machines. 

In this paper, we studied the changes of instance-weights of various types of SVMs. 
There are many other situations in which a path of multiple hyper-parameters can be ex- 
ploited. For instance, the application of the multi-parametric path approach to the following 
problems would be interesting future works: 



Different (functional) margin SVM: 

1„ 



mm — \\w 



i=l 



s.t. yif(xi) >di-£i, & > 0, i = l,...,n, 

where 5{ € M is a margin rescaling parameter. Chapelle and Keerthi ( 20ld ) indicated 
that this type of parametrization can be used to give different costs to each pair of 
items in ranking SVM. 

SVR with different insensitive-zone thickness. Although usual SVR has the common 
insensitive-zone thickness e for all instances, different thickness for every instance 
can be assigned: 



1 



mm 



n 



i J i— 1 

s.t. 



i=i 

Vi ~ f(xi) < + 
f(xi) -Vi< 



,n. 



In the case of common thicknes s, the optimal e is known to be asymptotic ally pro- 
portional to the noise variance (jSmola et all Il998l : iKwok and Tsanel . 120031 ). In the 
case of heteroscedastic noise, it would be reasonable to set different £j, each of which 
is proportional to the variance of each (iteratively estimated) yi. 

• The weight e d las so. The lasso solves the following quadratic programming problem 
(|Tibshiranil . ll996h : 



p p 
mm||y-^a;^-||! + A^|£ 



where A is t he regulariz ation parameter. A weighted version of the lasso has been 
considered in IZoT (|200fil ): 



p p 
P j= i j= i 
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where Wj is an individual weight parameters for each /3j. The weights are adaptively 
determined by wj = |/3j|~ 7 , where j3j is an initial estimation of (5j and 7 > is a 
parame ter. A similar weigh ted parameter-minimization problem has also been consid- 
ered in ICandes et al.l (]2008l ) in the context of signal reconstruction. They considered 
the following weighted £i-minimization probleirQ: 



min y wAxA 
xeR n ^— ' 
i=l 

s.t. y = <&x. 

where y £ R m , 3? € R mxn , and m < n. The goal of this problem is to reconstruct a 
sparse signal x from the measurement vector y and sensing matrix The constraint 
linear equations have infinitely many solutions and the simplest explanation of y is 
desirable. To estimate better sparse representation, they proposed an iteratively re- 
weighting strategy for estimating W{. 

In order to apply the multi-parametric approach to these problems, we need to determine 
the search direction of the path in the multi-dimensional hyper-parameter space. In many 
situations search directions can be estimated from data. 

2003 : 



2001; Martin 



Incremental- decremented SVM (ICauwenberghs and Poggiol. 
Ma and Theiler . 2003 : Laskov et al. . 20061 : Karasuyama and Takeuchi . 20091 ) exploits the 
piecewise linearity of the solutions. It updates SVM solutions efficiently when instances are 
added or removed from the training set. The incremental and decremental operations can 
be implemented using our instance-weight path approach. If we want to add an instance 
(xi,yi), we increase Cj from to some specified value. Conversely, if we want to remove 
an instance (xj,yj), we decrease Cj to 0. The paths generated by these two approaches 
are different in general. The instance-weight path keeps the optimality of all the instances 
including currently adding and/or removing ones. On the other hand, the incremental- 
decremental algorithm does not satisfy the optimality of adding and/or removing ones until 
the algorithm terminates. When we need to guarantee the optimality at intermediate points 
on the path, our approach is more useful. 

In the parametric programming approach, numerical instabilities sometimes cause com- 
putational difficulty. In practical implementation, we usually update several quantities such 
as a, b, yf(xi), and L from the previous values without calculating them from scratch. 
However, the rounding error may be accumulated at every breakpoints. We can avoid such 
accumulation using the refresh strategy: re-calculating variables from scratch, for exam- 
ple, every 100 steps (note that, in this case, we need 0(n\Ai\ 2 ) computations in every 100 
steps). Fortunately, such numerical instabilities rarely occurred in our experiments, and 
the accuracy of the KKT conditions of the solutions were kept high enough. 

Anoth er (but relat e d) nu merical difficulty arises when the matrix M in (117p is close to 
singular. I Wang et al. (|2008h pointed out that if the matrix is singular, the update is no 
longer unique. To the best of our knowledge, this degeneracy problem is not fully solved in 
path-following literature. Many heuristics are proposed to circumvent the problem, and we 
used one of them in the experiments: adding small positive constant to the diagonal elements 



4. Note that x and y in the above equation have different meanings from other parts of this paper. 
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of kernel matrix. Other strategies are also discussed in Wu et al. (120081 1. iGartner et al 
(120091 1. and lOng et all (j2O10h . 



Scalability of our algorithm depends on the size of M because a linear system with \ A4\ 
unknowns must be solved at each breakpoint. Although we can update the Cholesky factor 
by 0(|.M| 2 ) cost from the previous one, iterative methods such as conjugate-gradients may 
be more efficient than the direct matrix update when \ A4 \ is fairly large. When is small 
the parametric programming approach can be applied to relatively large data sets such as 
more than tens of thousands of instances (see, for example, the ranking experiments in 
Section 15.21) . 
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